---
title: "Example of Text Classification"
subtitle: "The Temporal Focus of Campaign Communication (The Journal of Politics)"
author: "Stefan Müller (muellerstefan.net)"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

This document shows how to apply the English classifier of the temporal focus (past/present/future) to a _quanteda_ text corpus. Note that the training set is a collection of sentences from party manifestos, but the non-annotated corpus in this example are legislative speeches from Ireland. **Thorough validation is required after the classifier has been  applied to different types of political text.**

In this example, I use `data_corpus_irishbudget2010`, reshape the corpus to the level of sentences, train a Support Vector Machine, and predict the class of all sentences from the speeches. I also return the 10 sentences with the highest probabilities of "past", "present", and "future" to quickly assess the face validity of the classification.

Please contact the author of the paper if you have any questions: stefan.mueller@ucd.ie

### Load Packages and Data

```{r, message=FALSE, warning=FALSE}
# load packages
library(quanteda)            # CRAN v2.0.1
library(quanteda.textmodels) # CRAN v0.9.1
library(dplyr)               # CRAN v0.8.5
library(kableExtra)          # CRAN v1.1.0

# load English classified sentences 
# (included in the JOP Dataverse folder)
dat_classified_english <- readRDS("data_sentences_classified_english.rds")
```

### Transform Training Data into a Document-Feature Matrix

```{r}
dfmat_coded_en <- dat_classified_english %>% 
    corpus() %>% 
    dfm()
```

### Train Support Vector Machine

```{r}
# use textmodel_svm (from quanteda.textmodels)

# Note: in the future, the package might include an 
# updated implementation of textmodel_svm()
# please use v.0.9.1 to reproduce the classification 
tmod_en_svm_class <- textmodel_svm(dfmat_coded_en,
                                   docvars(dfmat_coded_en, "class"))
```

### Prepare Non-Annotated Text Corpus

```{r}
# reshape data_corpus_irishbudget2010 to level of sentences
corp_ire <- corpus_reshape(data_corpus_irishbudget2010, to = "sentences")

# construct a dfm without any preprocessing
dfmat_ire <- dfm(corp_ire)
```

### Predict Class of Sentence-Level Speeches

```{r}
# predict class of each sentence and return the probability for each class
pred_ire_prob <- predict(tmod_en_svm_class, dfmat_ire, type = "probability")

# predict class of each sentence and return the class with the highest probability
pred_ire_class <- predict(tmod_en_svm_class, dfmat_ire, type = "class")

# get overview of classes in sentences from the Irish budget speeches
table(pred_ire_class)
```

### Face Validity: Sentences with the Highest Probabilities of "Past", "Present", and "Future"

```{r}
# bind texts and classication to a single data frame
dat <- data.frame(
    text = texts(corp_ire),
    pred_ire_prob
)
```

### Past

```{r}
# sentences with the highest probabilities of addressing the past
dat %>% 
    arrange(-Past) %>% 
    head(n = 10) %>% 
    kable(digits = 2) %>% 
    kable_styling(bootstrap_options = c("striped", "hover"))
```

### Present

```{r}
# sentences with the highest probabilities of addressing the present
dat %>% 
    arrange(-Present) %>% 
    head(n = 10) %>% 
    kable(digits = 2) %>% 
    kable_styling(bootstrap_options = c("striped", "hover"))
```

### Future

```{r}
# sentences with the highest probabilities of addressing the future
dat %>% 
    arrange(-Future) %>% 
    head(n = 10) %>% 
    kable(digits = 2) %>% 
    kable_styling(bootstrap_options = c("striped", "hover"))
```
